GVPT Maths Camp

Data Visualisation

Learning objectives for today

  1. Introduction to R

  2. Create your first plot in R

  3. Test your hypotheses using informative data visualizations

The Research Process

Source: R4DS

R basics

R code:

1 + 2
[1] 3

Functions:

sum(1, 2)
[1] 3

EXERCISE

  1. Open up RStudio.
  2. Using the console, find the summation of 45, 978, and 121.
  3. What is 67 divided by 6?
  4. What is the square root of 894?

CHECK YOUR ANSWERS

Using the consol, find the summation of 45, 978, and 121.

sum(45, 978, 121)
[1] 1144

Or:

45 + 978 + 121
[1] 1144

What is 67 divided by 6?

67 / 6
[1] 11.16667

What is the square root of 894?

sqrt(894)
[1] 29.89983

R packages

Packages are collections of R functions, data, and compiled code in a well-defined format.

# Install the relevant package(s)
install.packages("tidyverse")

# Load the packages in current session
library(tidyverse)

EXERCISE

  1. Open up RStudio.
  2. Using the consol, install the tidyverse packages.
install.packages("tidyverse")
  1. Load these packages in your current session
library(tidyverse)

Your first GitHub repository

Version control and collaboration

  • We will use Github to share and manage our scripts.

  • Your GVPT622 homework assignments will be submitted via Github.

  • A great tool for collaboration and dissemination of research.

EXERCISE

Follow along as we:

  1. Create a repository in GitHub for this course
  2. Link your new repository to a new RStudio project

RStudio Projects

For your sanity’s sake, for your co-author’s sanity’s sake


Keeps everything:

  • Organised

  • Reproducible

  • Sustainable

Data visualisation

From R4DS - Data Visualization:

Do cars with big engines use more fuel than cars with small engines?

Skipping to the end

How did we do this?

library(ggplot2)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) + 
  theme(
    legend.position = "bottom",
    panel.grid = element_blank(),
    panel.background = element_blank(),
    plot.title.position = "plot",
    plot.title = element_text(face = "bold")
  ) + 
  labs(
    title = "Relationship between engine displacement and highway miles per gallon by class",
    x = "Engine displacement (L)",
    y = "Highway miles per gallon",
    color = "Class"
  )

Load relevant packages and data

# Load the relevant packages
library(tidyverse)

# Load the data
mpg
manufacturer model displ year cyl
audi a4 1.8 1999 4
audi a4 1.8 1999 4
audi a4 2.0 2008 4
audi a4 2.0 2008 4
audi a4 2.8 1999 6
audi a4 2.8 1999 6

EXERCISE


Learn more about this data set by typing ?mpg into your console.

Plot your data

library(ggplot2)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

EXERCISE

What do you see when you run the following?

ggplot(data = mpg)

How many rows are in mpg? How many columns?

nrow(mpg)
ncol(mpg)

What does the drv variable describe?

?mpg

EXERCISE

Make a scatterplot of hwy vs cyl.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = hwy, y = cyl))

What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = class, y = drv))

Let’s look at groups in the data

  • We are not restricted to looking at only two interesting elements of our data.

  • You can use visual elements or aesthetics (aes) to communicate many dimensions in your data.

  • Let’s look at a categorical variable: the class of car (SUV, 2 seater, pick up truck, etc.).

  • Look for meaningfully defined groups.

Let’s look at groups in the data

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

Let’s look at groups in the data

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = class))

Let’s look at groups in the data

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))

Flexible visualization

You can use visual elements to communicate your findings in engaging ways.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class == "2seater"))

Changing the look of your plots

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "red")

EXERCISE

What’s gone wrong with this code? Why are the points not blue?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

EXERCISE

Which variables in mpg are categorical? Which variables are continuous?


Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?


What happens if you map the same variable to multiple aesthetics?

Let’s add useful headings

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "red") + 
  labs(
    title = "Relationship between engine displacement and highway miles per gallon",
    x = "Engine displacement (L)",
    y = "Highway miles per gallon"
  )

Let’s clean this up

Less is more when it comes to data visualization.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue") + 
  theme_minimal() + 
  labs(
    title = "Relationship between engine displacement and highway miles per gallon",
    x = "Engine displacement (L)",
    y = "Highway miles per gallon"
  )

Let’s clean this up

Creating your own theme

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) + 
  theme(
    legend.position = "bottom",
    panel.grid = element_blank(),
    panel.background = element_blank(),
    plot.title.position = "plot",
    plot.title = element_text(face = "bold")
  ) + 
  labs(
    title = "Relationship between engine displacement and highway miles per gallon by class",
    x = "Engine displacement (L)",
    y = "Highway miles per gallon",
    color = "Class"
  )

Creating your own theme

EXERCISE

  1. Create a scatterplot of hwy vs displ and a categorical variable in the mpg data set.

  2. Customize your plot using the theme() argument.

EXERCISE

  1. What happens if you facet on a continuous variable?

  2. What do the empty cells in previous plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

EXERCISE

  1. Create the following plots. What does . do?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

Summarising relationships in your data

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F)

Group-specific relationships

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F) + 
  theme_minimal()

Saving our work with Git and Github

Let’s wrap up! We need to:

  1. Save our work locally,

  2. Get our files ready to save to Github,

  3. Add a helpful message for our future selves letting us know what we did,

  4. Save our work to Github.

EXERCISE

To wrap up, let’s push our work to Github:

  1. Save your new script.

  2. Head over to the Git tab in the same area as our Environment tab.

  3. Stage your new file.

  4. Hit the commit button.

EXERCISE

  1. Add a useful message in the message area.

  2. Hit the commit button.

  3. Push your work to Github.

  4. Head over to your class repository on Github. You should see your file.

Summary

Today you:

  1. Set up your data science tools

  2. Plotted complex data in an engaging way

  3. Discovered interesting relationships in the data

  4. Connected these relationships or trends to your expectations (or hypotheses about the data)

  5. Created and used your first ever Github repository.